JAGS stands for “Just Another Gibbs Sampler” and is a tool for analysis of Bayesian hierarchical models using Markov Chain Monte Carlo (MCMC) simulation. It is an engine - more precisely, it is a probabilistic programming language (PPL) - for running BUGS in Unix-based environments and allows users to write their own functions, distributions and samplers.
In contrast to its ancestor, BUGS, JAGS has no graphical user interface (GUI). Thus, it must be run in a separate program, such as R. The following requirements are needed for using JAGS in R:
rjags and coda packages in
R.In the remainder, several models for which Gibbs sampling with
JAGS can be used are introduced as examples.
Do people prefer using the word “data” in the singular or plural?
Data journalists at FiveThirtyEight conducted a poll to address, among others, this question.
In particular, they asked the respondents which of the following sentence they prefer:
It turned out that 31 persons out of 35 prefer using data in the singular.
We want to specify a proper model for doing inference on the probability \(p\) that \(y\) out of \(n\) persons prefer one expression or the other.
# Clear the workspace:
rm(list=ls())
# Set the seed:
set.seed(1)
# Set the data:
n = 35
y = 31
# Create the list object:
dataList = list(y = y, n = n)
# Load the packages:
library(rjags)
## Loading required package: coda
## Linked to JAGS 4.3.2
## Loaded modules: basemod,bugs
library(coda)
# Implement the JAGS code for this model
model_string = textConnection("model{
}")
Thus, we consider the Beta-Binomial model,
\[ \begin{align*} y\mid p & \sim Bin(n,p) \\ p & \sim Be(\alpha, \beta) \end{align*} \]
that was introduced in the first lab, where \(\alpha=3\) and \(\beta=1\). Then,
textConnection() function;jags.model, draw the burn-in
samples and check for convergence using update, and finally
draw samples from the full conditional posterior with
coda.samples;Click on the link for the reference and more.
# Write the code here.
Consider the dataset BostonHousing2 about housing in
Boston, available through the mlbench library, see the link.
The original data are 506 observations on 14 variables,
medv being the target variable:
The corrected data set has the following additional columns:
In this example we consider a linear regression model where
cmedv is taken as the dependent variable. Similarly, we may
consider crim as the variable of interest.
rm(list=ls())
library(rjags)
library(coda)
# Import the dataset:
library(mlbench)
data(BostonHousing2)
# Clean the data:
bhousing = within(BostonHousing2,rm("medv","tract","town"))
bhousing$chas = as.numeric(bhousing$chas)
# Get the summary statistics:
summary(bhousing)
## lon lat cmedv crim
## Min. :-71.29 Min. :42.03 Min. : 5.00 Min. : 0.00632
## 1st Qu.:-71.09 1st Qu.:42.18 1st Qu.:17.02 1st Qu.: 0.08205
## Median :-71.05 Median :42.22 Median :21.20 Median : 0.25651
## Mean :-71.06 Mean :42.22 Mean :22.53 Mean : 3.61352
## 3rd Qu.:-71.02 3rd Qu.:42.25 3rd Qu.:25.00 3rd Qu.: 3.67708
## Max. :-70.81 Max. :42.38 Max. :50.00 Max. :88.97620
## zn indus chas nox
## Min. : 0.00 Min. : 0.46 Min. :1.000 Min. :0.3850
## 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:1.000 1st Qu.:0.4490
## Median : 0.00 Median : 9.69 Median :1.000 Median :0.5380
## Mean : 11.36 Mean :11.14 Mean :1.069 Mean :0.5547
## 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:1.000 3rd Qu.:0.6240
## Max. :100.00 Max. :27.74 Max. :2.000 Max. :0.8710
## rm age dis rad
## Min. :3.561 Min. : 2.90 Min. : 1.130 Min. : 1.000
## 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100 1st Qu.: 4.000
## Median :6.208 Median : 77.50 Median : 3.207 Median : 5.000
## Mean :6.285 Mean : 68.57 Mean : 3.795 Mean : 9.549
## 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188 3rd Qu.:24.000
## Max. :8.780 Max. :100.00 Max. :12.127 Max. :24.000
## tax ptratio b lstat
## Min. :187.0 Min. :12.60 Min. : 0.32 Min. : 1.73
## 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38 1st Qu.: 6.95
## Median :330.0 Median :19.05 Median :391.44 Median :11.36
## Mean :408.2 Mean :18.46 Mean :356.67 Mean :12.65
## 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23 3rd Qu.:16.95
## Max. :711.0 Max. :22.00 Max. :396.90 Max. :37.97
We run a preliminary data analysis to check for any missing data and produce the basic summary statistics. Then, we visualize the variables’ distribution using the boxplot and analyze the linear relationships among the variables with a correlation matrix.
# Boxplots:
par(mfrow=c(1, 1),pty="m")
boxplot(scale(bhousing),las=3,main="Standardized covariates",cex.axis=0.75)
# Ridgeline plot
library(ggplot2)
library(ggridges)
library(tidyr) # or library(reshape2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
bhousing_long <- as.data.frame(scale(bhousing)) %>% # Convert to long format
pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value")
ggplot(bhousing_long, aes(x = Value, y = Variable, fill = Variable)) +
geom_density_ridges(scale = 1.2, alpha = 0.7, rel_min_height = 0.01, show.legend = FALSE) +
labs(title = "Ridgeline plot of standardized covariates", x = "Standardized Value", y = "") +
theme_minimal()
## Picking joint bandwidth of 0.208
# Correlation matrix:
par(mfrow=c(1, 1),pty="s")
image(1:ncol(bhousing),1:ncol(bhousing),cor(bhousing),
xlab="",ylab="",main="Correlation between predictors",
axes=FALSE)
axis(1,1:ncol(bhousing),colnames(bhousing),las=2,cex.axis=0.9)
axis(2,1:ncol(bhousing),colnames(bhousing),las=2,cex.axis=0.9)
The boxplots illustrate the presence of several extreme observations in many variables. In general, these observations could be quite insighful and should require further investigation when they are a few.
In this case, since we notice a substantial number of observations
located far from the medians, we rather conclude that the variables
display a strong skewness (e.g. crim and b) or
kurthosis (e.g. lon, lat, and
rm).
The correlation matrix suggests that a positive linear relationship
exists between cmedv and rm, and in a lesser
way with zn and b, and a negative one between
cmedv and ptratio. It follows that these may
be good predictors of cmedv.
We consider the following linear regression model,
\[ \begin{split} & y_i \sim \mathscr{N}(\alpha+ X_i \beta,\sigma^2) \\ & \alpha \sim \mathscr{N}(0,100) \\ & \beta_j \sim \mathscr{N}(0,\sigma_b^2)\\ & \sigma^{-2} \sim \mathscr{G}a(0.01,0.01) \\ & \sigma_b^{-2} \sim \mathscr{G}a(0.01,0.01) \\ \end{split} \]
which is a Gaussian hierarchical model such as the Normal–Inverse Gamma model, from which is different, however, because we put a prior on the variance \(\sigma^2\) too. Thus, rather than fixing the prior variance for each \(\beta_j\), the model learns it from the data.
Moreover, when the number of predictors is large or some predictors are weak, the hierarchical prior avoids overfitting by pulling the less relevant coefficients toward zero:
Like every PPL, JAGS allows the user either to store the model
specification as a separate file using a text editor or to embed the
BUGS model into R. The syntax is very simple and intuitive, see the user
guide for any help. For the purpose of learning JAGS we will always
opt for the inline specification, below called
model_string. Finally, we need to call the dependent
variable, covariates, number of observations and number of covariates as
declared the code, in our example being respectively Y,
X, n and p, and store them in a
list dataList.
# Model settings:
Y = bhousing$cmedv
X = within(bhousing,rm("cmedv","lon","lat"))
p = ncol(X)
n = nrow(X)
# Define the list:
dataList = list(Y=Y,X=X,n=n,p=p)
# Write the model specification:
model_string = textConnection(
"model{
# Likelihood
for(i in 1:n){
Y[i] ~ dnorm(alpha+inprod(X[i,],beta[]),inv.var)
}
# Priors
for(j in 1:p){
beta[j] ~ dnorm(0,inv.var.b)
}
alpha ~ dnorm(0,0.01)
inv.var ~ dgamma(0.01, 0.01)
inv.var.b ~ dgamma(0.01, 0.01)
sigma2 = 1/inv.var
sigma2_b = 1/inv.var.b
}")
Notice that BUGS requires the precision, instead of the variance, to define the normal distribution.
With little implementation effort we are ready to estimate the model.
We can modify the settings, e.g. the burn in period burn,
total number of iterations n.iter, number of adaptations
n.adapt, the thinning thin, and number of
chains to be produced n.chains.
# Set the options:
burn = 5000
n.iter = 10000
n.adapt = 100
thin = 10
n.chains = 1
# Create a `jags` object:
model = jags.model(model_string,data=dataList,n.chains=n.chains,n.adapt=n.adapt)
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 506
## Unobserved stochastic nodes: 16
## Total graph size: 8626
##
## Initializing model
# Check for convergence:
update(model,n.iter=burn)
# Save the output according to the arguments:
samples = coda.samples(model,variable.names=c("alpha","beta","sigma2","sigma2_b"),thin=thin,n.iter=n.iter)
We have saved the output in samples, where the chains
are stored. The details of the model are stored in model.
Using summary we can view the details concerning the
samples that were produced and the empirical mean, standard deviation,
and quantiles of the parameter distributions.
# Traceplot and posterior density:
for(k in 0:7){ plot(samples[,(2*k+1):(2*(k+1))]) }
# Autocorrelation function:
for(k in c(3,7)){ autocorr.plot(samples[,(2*k+1):(2*(k+1))]) }
Through the trace plots we aim to evaluate the mixing of the samples. It relates to the speed by which the sequence of draws converges to the stationary distribution, in our case being the full conditional posterior.
# Complete summary:
summary(samples)
##
## Iterations = 5010:15000
## Thinning interval = 10
## Number of chains = 1
## Sample size per chain = 1000
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## alpha 21.475200 3.994704 0.1263236 0.8492043
## beta[1] -0.096140 0.032619 0.0010315 0.0011434
## beta[2] 0.049442 0.013430 0.0004247 0.0007022
## beta[3] -0.030388 0.059776 0.0018903 0.0040300
## beta[4] 2.197793 0.821313 0.0259722 0.0507333
## beta[5] -2.777673 2.119303 0.0670182 0.1677565
## beta[6] 3.998891 0.391686 0.0123862 0.0638174
## beta[7] -0.008798 0.012335 0.0003901 0.0008255
## beta[8] -1.206973 0.168374 0.0053244 0.0099976
## beta[9] 0.266540 0.067298 0.0021281 0.0054251
## beta[10] -0.013793 0.003810 0.0001205 0.0004026
## beta[11] -0.730219 0.113638 0.0035935 0.0153701
## beta[12] 0.010804 0.002618 0.0000828 0.0001746
## beta[13] -0.544255 0.047943 0.0015161 0.0032404
## sigma2 22.938104 1.478681 0.0467600 0.0490708
## sigma2_b 3.327412 2.195553 0.0694295 0.1270806
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## alpha 13.922698 18.940488 20.986745 24.0980098 30.309781
## beta[1] -0.159910 -0.117519 -0.097291 -0.0748222 -0.028921
## beta[2] 0.023909 0.040850 0.049204 0.0582174 0.076780
## beta[3] -0.147558 -0.073333 -0.030446 0.0129870 0.079307
## beta[4] 0.601363 1.652936 2.186272 2.7124088 3.856337
## beta[5] -7.587602 -3.862592 -2.447898 -1.3287197 0.435003
## beta[6] 3.293477 3.756530 3.980581 4.2515050 4.786560
## beta[7] -0.032245 -0.017333 -0.009074 -0.0009598 0.015931
## beta[8] -1.547642 -1.321861 -1.202107 -1.0943644 -0.886878
## beta[9] 0.135099 0.217192 0.267912 0.3138517 0.393986
## beta[10] -0.021891 -0.016199 -0.013772 -0.0110038 -0.006553
## beta[11] -0.956156 -0.801443 -0.723691 -0.6608483 -0.517664
## beta[12] 0.006014 0.009054 0.010689 0.0125976 0.015900
## beta[13] -0.640692 -0.575608 -0.542642 -0.5107082 -0.452804
## sigma2 20.116804 21.927836 22.944038 23.8341898 26.094652
## sigma2_b 1.037446 1.856539 2.776988 3.9618484 9.230980
Recall that the naive standard error is the standard deviation (\(SD\)) of a coefficient estimate \(\hat{\beta}_j\):
\[
\text{Naive SE} = \frac{\text{SD}(\hat{\beta}_j)}{\sqrt{S}}
\] where \(S\) is the number of
MCMC samples. Time-series SE is the corrected standard
error of the mean, accounting for autocorrelation between samples, where
in practice the posterior standard deviations are divided by the
effective sample size.
Indeed, if the samples appeared to be autocorrelated over time, we might desire to increase even more the number of iterations and burn-in period to get less dependence among subsequent draws (at the cost of increasing the computational time).
The effective sample size (ESS) measures how much information content is loss due to the correlation in the sequence. In fact, we expect that our effective sample size will be smaller than the observed sample size when the autocorrelation is large.
The ESS thus returns the number of independent samples with the same
estimation power as the n.iter autocorrelated samples
obtained from the MCMC. According to its definition, it is desirable to
have an ESS that is close, or even larger than the sample size
n. On the other hand, it goes to zero if the
autocorrelation in the chain is substantial.
# Effective sample size
effectiveSize(samples)
## alpha beta[1] beta[2] beta[3] beta[4] beta[5] beta[6] beta[7]
## 22.12813 813.77061 365.76127 220.01175 262.07823 159.59785 37.67018 223.28369
## beta[8] beta[9] beta[10] beta[11] beta[12] beta[13] sigma2 sigma2_b
## 283.63322 153.87850 89.58277 54.66231 224.88185 218.90714 908.03648 298.48962
First, we load the Major League Baseball data from the 1986 and 1987
seasons from the hitters dataset in the ISLR2
package. A data frame with 322 observations of major league players on
the following 20 variables:
AtBat: Number of times at bat in 1986Hits: Number of hits in 1986HmRun: Number of home runs in 1986Runs: Number of runs in 1986RBI: Number of runs batted in in 1986Walks: Number of walks in 1986Years: Number of years in the major leaguesCAtBat: Number of times at bat during careerCHits: Number of hits during careerCHmRun: Number of home runs during careerCRuns: Number of runs during careerCRBI: Number of runs batted in during careerCWalks: Number of walks during careerLeague: A factor with levels A and N indicating
player’s league at the end of 1986Division: A factor with levels E and W indicating
player’s division at the end of 1986PutOuts: Number of put outs in 1986Assists: Number of assists in 1986Errors: Number of errors in 1986Salary: 1987 annual salary on opening day (in thousands
of dollars)NewLeague: A factor with levels A and N indicating
player’s league at the beginning of 1987We are interested in determining which characteristics of a hitter
might explain his salary. So the dependent variable in our model is
Salary.
rm(list=ls())
library(rjags)
library(coda)
# Load the data
library(ISLR2)
## Warning: package 'ISLR2' was built under R version 4.3.3
data(Hitters)
Hitters = na.omit(Hitters) # Remove rows with missing Salary
# Some descriptive
summary(Hitters)
## AtBat Hits HmRun Runs
## Min. : 19.0 Min. : 1.0 Min. : 0.00 Min. : 0.00
## 1st Qu.:282.5 1st Qu.: 71.5 1st Qu.: 5.00 1st Qu.: 33.50
## Median :413.0 Median :103.0 Median : 9.00 Median : 52.00
## Mean :403.6 Mean :107.8 Mean :11.62 Mean : 54.75
## 3rd Qu.:526.0 3rd Qu.:141.5 3rd Qu.:18.00 3rd Qu.: 73.00
## Max. :687.0 Max. :238.0 Max. :40.00 Max. :130.00
## RBI Walks Years CAtBat
## Min. : 0.00 Min. : 0.00 Min. : 1.000 Min. : 19.0
## 1st Qu.: 30.00 1st Qu.: 23.00 1st Qu.: 4.000 1st Qu.: 842.5
## Median : 47.00 Median : 37.00 Median : 6.000 Median : 1931.0
## Mean : 51.49 Mean : 41.11 Mean : 7.312 Mean : 2657.5
## 3rd Qu.: 71.00 3rd Qu.: 57.00 3rd Qu.:10.000 3rd Qu.: 3890.5
## Max. :121.00 Max. :105.00 Max. :24.000 Max. :14053.0
## CHits CHmRun CRuns CRBI
## Min. : 4.0 Min. : 0.00 Min. : 2.0 Min. : 3.0
## 1st Qu.: 212.0 1st Qu.: 15.00 1st Qu.: 105.5 1st Qu.: 95.0
## Median : 516.0 Median : 40.00 Median : 250.0 Median : 230.0
## Mean : 722.2 Mean : 69.24 Mean : 361.2 Mean : 330.4
## 3rd Qu.:1054.0 3rd Qu.: 92.50 3rd Qu.: 497.5 3rd Qu.: 424.5
## Max. :4256.0 Max. :548.00 Max. :2165.0 Max. :1659.0
## CWalks League Division PutOuts Assists
## Min. : 1.0 A:139 E:129 Min. : 0.0 Min. : 0.0
## 1st Qu.: 71.0 N:124 W:134 1st Qu.: 113.5 1st Qu.: 8.0
## Median : 174.0 Median : 224.0 Median : 45.0
## Mean : 260.3 Mean : 290.7 Mean :118.8
## 3rd Qu.: 328.5 3rd Qu.: 322.5 3rd Qu.:192.0
## Max. :1566.0 Max. :1377.0 Max. :492.0
## Errors Salary NewLeague
## Min. : 0.000 Min. : 67.5 A:141
## 1st Qu.: 3.000 1st Qu.: 190.0 N:122
## Median : 7.000 Median : 425.0
## Mean : 8.593 Mean : 535.9
## 3rd Qu.:13.000 3rd Qu.: 750.0
## Max. :32.000 Max. :2460.0
As the dataset includes many variables (20), we are interested in excluding a few and focus on the most important determinants. The following model is designed to achieve this goal when we are in high-dimensional settings.
Consider the Bayesian Least Absolute Shrinkage and Selection Operator (LASSO) model,
\[ \begin{split} & y_i |\alpha,\beta_j,\sigma^2 \sim \mathscr{N}(\alpha+ X_i \beta,\sigma^2) \\ & \alpha \sim \mathscr{N}(0,100) \\ & \beta_j|\sigma,\lambda \sim \mathscr{DE}(0,\sigma/\lambda)\\ & \sigma^{2} \sim \mathscr{IG}(0.01,0.01) \\ & \lambda \sim \mathscr{Ga}(0.01,0.01) \\ \end{split} \] where \(\mathscr{DE}(0,\sigma/\lambda)\) denotes a double-exponential (Laplace) distribution with location 0 and scale \(\sigma/\lambda\).
As the Laplace distribution is sharply peaked at zero means, this prior assigns a higher probability density to coefficient values being very close to zero, compared to, say, a Gaussian prior. This “pulls” or “shrinks” many coefficients towards zero. In addition, the heavy tails still allow for a large coefficient, preventing over-shrinkage of truly important predictors.
This is also a hierarchical model as, similarly to \(\sigma_b^2\) in the Gaussian hierarchical model, a layer is added with the introduction of the prior on \(\lambda\). The key idea is that penalization should be relative to the noise level \(\sigma^2\).
In particular:
For more information, take a look to the original paper and the JAGS example at this link.
# Model settings:
Y = as.numeric(scale(Hitters$Salary))
X = scale(model.matrix(Salary ~ ., data = Hitters)[, -1]) # Remove intercept column and "dummify"
p = ncol(X)
n = nrow(X)
# Define the list:
dataList = list(Y=Y,X=X,n=n,p=p)
# Implement the JAGS code for this model
model_string = textConnection("model{
# Likelihood
for(i in 1:n){
Y[i] ~ dnorm(mu[i], inv.sigma2)
mu[i] = alpha + inprod(X[i,], beta[])
}
# Prior for alpha
alpha ~ dnorm(0, 0.01)
# Prior for beta
for(j in 1:p){
beta[j] ~ ddexp(0, inv.sigma/inv.lambda)
}
# Prior for inv.sigma2 and lambda
inv.sigma2 ~ dgamma(0.01, 0.01)
lambda ~ dgamma(0.01, 0.01)
inv.lambda = 1/lambda
inv.sigma = sqrt(inv.sigma2)
sigma = 1/sqrt(inv.sigma2)
}")
We are ready to initialize JAGS and run it.
# Set the options:
burn = 5000
n.iter = 10000
n.adapt = 100
thin = 10
n.chains = 1
# Create a `jags` object:
model = jags.model(model_string,data=dataList,n.chains=n.chains,n.adapt=n.adapt)
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 263
## Unobserved stochastic nodes: 22
## Total graph size: 6081
##
## Initializing model
# Check for convergence:
update(model,n.iter=burn)
# Save the output according to the arguments:
samples = coda.samples(model,variable.names=c("alpha", "beta", "sigma"),thin=thin,n.iter=n.iter)
Again, we look at the trace plots…
# Traceplot and posterior density:
samples_idx = sample(0:9,3)
for(k in samples_idx){ plot(samples[,(2*k+1):(2*(k+1))]) }
and check the autocorrelation function.
# Traceplot and posterior density:
for(k in samples_idx){ autocorr.plot(samples[,(2*k+1):(2*(k+1))]) }
As a decision rule, we can compute credible intervals at the 95% level and check whether the intervals encompass zero, that is if the zero lies within the intervals. If not, we conclude that there exists a significant relationship between hitters’ salaries and that variable.
# Compute posterior summaries
CI = apply(samples[[1]], 2, function(x) {
quantile(x, probs = c(0.025, 0.5, 0.975))
})
# Transpose and convert to dataframe
CI = as.data.frame(t(CI))
# Rename columns for clarity
colnames(CI) = c("Lower_2.5%", "Median", "Upper_97.5%")
# Print the result
print(CI)
## Lower_2.5% Median Upper_97.5%
## alpha -0.089765838 5.851031e-04 0.09373082
## beta[1] -0.679979350 -2.086375e-01 0.04756436
## beta[2] 0.013751059 3.049512e-01 0.75624587
## beta[3] -0.144644373 -6.236473e-03 0.11954337
## beta[4] -0.176370248 2.658891e-02 0.25928993
## beta[5] -0.144760327 2.226342e-02 0.20367549
## beta[6] 0.004955541 1.453241e-01 0.31641106
## beta[7] -0.275119422 -6.311935e-02 0.09179468
## beta[8] -0.361070491 -1.559645e-02 0.25664489
## beta[9] -0.188031859 1.580108e-01 0.54519089
## beta[10] -0.086188754 8.854842e-02 0.32211891
## beta[11] -0.073275256 1.903019e-01 0.75814005
## beta[12] -0.075232662 1.735955e-01 0.56364428
## beta[13] -0.443175524 -1.066464e-01 0.07158210
## beta[14] -0.077543371 3.492686e-02 0.17883030
## beta[15] -0.211240289 -1.248737e-01 -0.04328273
## beta[16] 0.058997391 1.550846e-01 0.25146716
## beta[17] -0.064029292 3.103612e-02 0.16275668
## beta[18] -0.148413860 -3.036051e-02 0.06762054
## beta[19] -0.140946515 -8.117533e-05 0.10451808
## sigma 0.657814224 7.151444e-01 0.78045224
The ggs_caterpillar function from the package
ggmcmc could also be used to display the credible intervals
for a class of parameters of interest, e.g. \(\beta_1,\ldots,\beta_p\). By default, it
plots credible intervals with thin lines at the 95%
level and credible intervals with thick lines at the
90% level.
Warning: it is not clear why the name on the
horizontal axis is labelled HPD (highest posterior
density), as the credible intervals are in fact displayed.
# Convert the results from JAGS into a `ggmcmc` object
library(ggmcmc)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
beta_labels = plab("beta", list(Variable = colnames(Hitters)[c(1:18, 20)]))
ggmcmc_object <- ggs(samples, par_labels = beta_labels, family = "beta")
# Create a "caterpillar plot"
ggs_caterpillar(D = ggmcmc_object, thick_ci = c(0.05, 0.95), thin_ci = c(0.025, 0.975),)
Remarks:
The function update performs burn-in iterations for
the Markov chain. After the burn-in period the chain is supposed to
converge to its stationary distribution. Performance diagnostics such as
the trace plot have to be inspected to check convergence.
The adaptive iterations (see n.adapt) that JAGS
performs are not burn-in iterations nor samples to be used for posterior
inference. The documentation reports that The sequence of samples
generated during this adaptive phase is not a Markov chain, and
therefore may not be used for posterior inference on the model. The
adaptive phase is only suppose to calibrate the model ex ante
to improve efficiency of the sampling.
Consider the Bayesian linear regression model with Normal – Inverse Gamma priors (see Bayesian Core, Chapter 3, p. 54):
\[ \begin{align*} & y_i \sim \mathscr{N}(\alpha+ X_i \beta,\sigma^2), \\ & \alpha \sim \mathscr{N}(0,100), \\ & \beta_j \sim \mathscr{N}(0,100), \\ & \sigma^2 \sim \mathscr{IG}(0.01,0.01). \end{align*} \]
For comparison, in the lab 04-Regression (Example
1) we introduced a Bayesian linear regression model where the
variance \(\sigma^2\) was assumed to be
known.
For this model, it is possible to derive analytically the marginal posteriors for \(\beta\) and \(\sigma^2\), but in this exercise we will use JAGS:
A generalized linear model (GLM) is a generalization of the ordinary linear regression model where a link function is used to relate the response to a number of independent variables. For example, the logistic, probit, and Poisson regression belong to this broad class.
In particular, logistic regression commonly refers to a model for dichotomous data, that is when the dependent variable \(y\) only takes values equal to 0 or 1. However, such a denomination is misleading because it comprises a broader class of models where the logistic function is used as the link function, e.g. the binomial and Bernoulli regressions.
In particular, for binary variables the standard linear model would predict values above 1 or below 0. Thus we employ the logistic, or logit link function to bound the response variable between 0 and 1,
\[ \begin{align*} z_i &= X\beta, \\ p_i &= \frac{1}{1+e^{-z_i}}, \end{align*} \] where \(z_i = \log\left(\frac{p_i}{1-p_i}\right)\) is also known as the log-odds ratio and \(p_i\in[0,1]\) is the probability of \(y_i\) being equal to 1 for the \(i\)-th observation, computed as the logit transformation of \(z_i\).
We consider the binomial regression model,
\[ \begin{align*} & y_i \mid p_i \overset{ind}\sim \mathscr{B}in(p_i,k), \\ & \log\left(\frac{p_i}{1-p_i}\right) = x_i \beta, &\qquad i=1,\ldots,n \\[15pt] &\beta_j \overset{iid}\sim \mathscr{N}(0,100), &\qquad j=1,\ldots,p \end{align*} \]
where \(k\) denotes the number of trials, taken as fixed, and we denote with \(\text{logit}(p_i) = x_i \beta\) to relax the notation which is used to specify the model in the BUGS syntax.
We consider real mortality data to model the count of deaths for
moths depending on the sex and given the dose
of a poisonous substance. The data are taken from Royla
and Dorazio, Chapter 2, which is available online. In our model we
are interested in a sequence of \(K =
20\) trials, thus we write the code accordingly.
# Clear the workspace:
rm(list=ls())
# Set the seed:
set.seed(1)
# Load the packages:
library(rjags)
library(coda)
# Create the dependent variable:
Y = c(1, 4, 9, 13, 18, 20, 0, 2, 6, 10, 12, 16)
# Create the independent variables:
sex = c(rep("male", 6), rep("female", 6))
dose = rep(0:5, 2)
X = cbind(dose,sex=as.integer(sex=="male"))
# Settings
n = length(Y)
P = ncol(X)
k = 20
burn = 5000
n.iter = 2000
n.chains = 3
n.adapt = 100
thin = 2
We write the BUGS code, as done previously, for the model above.
model_string = textConnection(
"model {
# Likelihood
for (i in 1:n) {
Y[i] ~ dbin(p[i],k)
logit(p[i]) = beta0 + X[i,] %*% beta
}
# Prior
beta0 ~ dnorm(0,0.001)
for (j in 1:P) {
beta[j] ~ dnorm(0,0.01)
}
}")
Therefore we fit the model. Notice that n.chains = 3,
hence we will produce three different sequences of draws. This approach
is good if the user wants to initialize sampling with different values,
perhaps exploring better every region of the parameter support.
# Set up the data
dataList = list(Y=Y,X=X,n=n,P=P,k=k)
# Compile the model
model = jags.model(model_string,data=dataList,n.chains=n.chains,n.adapt=n.adapt)
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 12
## Unobserved stochastic nodes: 3
## Total graph size: 94
##
## Initializing model
# Model burn-in phase
update(model,n.iter=burn)
# Sampling from the posterior
samples = coda.samples(model,variable.names=c("beta0","beta","p"),n.iter=n.iter,thin=thin)
The usual trace plot and density are exhibited as follows.
# Plot the samples:
samples_idx = sample(4:15,2)
plot(samples[,samples_idx])
We compute the posterior mean and credible intervals for the
intercept and two other coefficients. As shown below, this is quite
easy. Notice that such metrics are computed based on all the chains, not
only to a single one. Conversely, if one wants to produce them
separately for each chain, she has to access e.g. the first chain as
samples[[1]].
# Compute posterior mean and 95% CI
postMean = colMeans(as.matrix(samples))
CI = apply(as.matrix(samples),2,quantile,c(0.025, 0.975))
rbind("Mean"=postMean,CI)
## beta[1] beta[2] beta0 p[1] p[2] p[3] p[4]
## Mean 1.0914932 1.1248266 -3.557633 0.08554507 0.2115811 0.4387914 0.6960784
## 2.5% 0.8426731 0.4355807 -4.513067 0.03789676 0.1237691 0.3244475 0.5867870
## 97.5% 1.3576927 1.8439755 -2.681534 0.15545710 0.3155559 0.5536759 0.7966544
## p[5] p[6] p[7] p[8] p[9] p[10] p[11]
## Mean 0.8692001 0.9499479 0.03056927 0.08223332 0.2055267 0.4306827 0.6886711
## 2.5% 0.7891950 0.9013087 0.01084585 0.03980551 0.1268471 0.3161823 0.5674043
## 97.5% 0.9313707 0.9803280 0.06407182 0.14369805 0.2998579 0.5490868 0.7971541
## p[12]
## Mean 0.8644984
## 2.5% 0.7714625
## 97.5% 0.9335157
# Alternatively
#summary(samples)
As a matter of fact, we can plot the density obtained from each chain separately as follows.
# Set a color palette:
colorsPalette = palette(rainbow(n.chains))
# Plot the chain full conditional posteriors:
plot(density(unlist(samples[[1]][,"beta[2]"])),col=colorsPalette[1],
main="Chain-specific posterior density for beta[2]")
for (j in 2:n.chains) {
lines(density(unlist(samples[[j]][,"beta[2]"])),col=colorsPalette[j])
}
What are the effects of dose and sex on the
deaths? Concerning sex, beta[[2]] is always
positive, hence being males increases the probability of death compared
to females. Relative to dose, we can help ourselves with
the following graph where we display the probability of death across
\(K=20\) independent experiments.
# Set some variables:
orderDose = order(X[,1])
colorSex = ifelse(sex=='male','blue','red')
# Plot actual VS predicted number of deaths:
plot(X[,1],Y,pch=19,col=colorSex,xlab="Dose",ylab="Death count")
points(X[orderDose,1],
k * boot::inv.logit(postMean[3]+postMean[2]*1+postMean[1]*X[orderDose,1]),
col="blue",pch=2) # `inv.logit()` from the `boot` package
points(X[orderDose,1],
k * boot::inv.logit(postMean[3]+postMean[2]*0+postMean[1]*X[orderDose,1]),
col="red",pch=6)
legend("topleft",legend=c("Males","Females"),lty=1,lwd=2,col=c("blue","red"),bty="n")
In particular, we compare observed deaths with model-based
predictions of expected deaths for males and females, across different
doses. Recall that if \(Y\sim\mathscr{Bin}(p,k)\), then \(E(Y)=kp\). Therefore
k * inv.logit(beta0 + beta2*sex + beta1*dose) computes the
expected number of deaths.
By inspecting the graph we conclude that males and females both die more as the number of doses increase, but males display a higher mortality given the same number of doses, even when no dose is given.
Consider the Bernoulli regression model,
\[ \begin{align*} & y_i \mid p_i \overset{ind}\sim \mathscr{B}e(p_i), &\qquad i=1,\ldots,n \\[15pt] 1) \quad & \log\left(\frac{p_i}{1-p_i}\right) = x_i \beta, &\qquad \rightarrow\text{logistic link} \\ 2) \quad & \Phi^{-1}(p_i) = x_i \beta, &\qquad \rightarrow\text{probit link} \\[15pt] &\beta_j \overset{iid}\sim \mathscr{N}(0,100), &\qquad j=1,\ldots,p \end{align*} \]
for responses \(y_i\) that can take values equal to 0 or 1 and where the logistic or probit link function might be used. Notice that, in the probit case, \(\Phi(\cdot)\) denotes the normal c.d.f. and, hence, \(\Phi(\cdot)^{-1}\) its quantile.
We load the titanic dataset about the survival of the
passengers of the Titanic; more information are available here.
Most of the variables are not useful for a quantitative analysis,
thus we remove them. Afterwards, we also remove passengers who report
missing values about their age. We would like to set up a model to
predict the survival of a passenger, Survived, based on the
attributes Pclass i.e. the passenger’s class,
Sex and Age.
rm(list=ls())
# Load the packages:
library(rjags)
library(coda)
# Load the data:
library(titanic)
data = titanic_train[,c(2,3,5,6)]
data = data[-which(is.na(data$Age)),]
# Summary statistics:
summary(data)
## Survived Pclass Sex Age
## Min. :0.0000 Min. :1.000 Length:714 Min. : 0.42
## 1st Qu.:0.0000 1st Qu.:1.000 Class :character 1st Qu.:20.12
## Median :0.0000 Median :2.000 Mode :character Median :28.00
## Mean :0.4062 Mean :2.237 Mean :29.70
## 3rd Qu.:1.0000 3rd Qu.:3.000 3rd Qu.:38.00
## Max. :1.0000 Max. :3.000 Max. :80.00
# Training and test set:
TrainData = data[nrow(data)-1,]
TestData = data[nrow(data),]
Complete the following tasks:
TrainData, i.e. choosing one between the
specifications 1 and 2 in the formulas above. Hint: in the
BUGS syntax the logit link function is called with
logit and the probit link function is called with
probit;TestData.# Write you code here.
The Bayesian model selection was introduced in the context of the
BAS library. After eliciting a model prior \(\pi(M_j)\), for each model \(j=1,\ldots,M\), we are able to compute its
posterior distribution \(\pi(M_j \mid
\mathbf{y})\).
However, \(M=2^p\) in a linear regression model with \(p\) covariates. Thus, the number of models quickly increases and model complexity, i.e. the number of parameters, increases as well, which may be detrimental for estimation.
In this example we consider a different approach where some coefficients may be shrunk toward zero, so the model performs variable selection. This methodology differs from the hierarchical and Bayesian LASSO models introduced above in that it allows the user to exactly set some coefficient equal to zero.
We index each of the possible \(2^p\) combinations through the vector
\[ \mathbf{\gamma} = \left(\gamma_1,\dots,\gamma_p\right), \]
where \(\gamma_j = 0\) if \(\beta_j = 0\) and \(\gamma_j = 1\) if \(\beta_i \neq 0\) and consider the mixture prior \(\pi(\beta,\gamma) = \pi(\beta\mid\gamma)\pi(\gamma)\), which is specified below.
The linear regression model with spike & slab priors is:
\[ \begin{align*} & \beta_j \mid \gamma_j \overset{ind}\sim (1-\gamma_j)\,\delta_{(0)} + \gamma_j\,\mathscr{N}\left(0, \sigma^2_{\beta_j}\right), \\ & \gamma_j \mid \theta_j \overset{ind}\sim \mathscr{B}e\left(\theta_j\right), \\[5pt] & \theta_{j} \overset{iid}\sim p\left(\theta_j\right), \end{align*} \]
where \(\theta_j\) is a probability which determines whether \(\beta_j\) is nonzero and hence whether the corresponding covariate will be included in the model, \(\delta_{\{0\}}\) is a Dirac delta, i.e. \(\delta_{(0)}(x) = 0\) if \(x\neq 0\) while it shows infinite mass for \(x=0\). A common choice for \(\theta_j\) is the uniform distribution over \(\left(0,1\right)\), or the \(\mathscr{B}eta(1,1)\) distribution.
In the following example, we consider again the Hitters
dataset and fit a linear regression model with a spike and slab prior on
the coefficients.
rm(list=ls())
#library(mlbench)
library(rjags)
library(coda)
# Load the data
library(ISLR2)
data(Hitters)
Hitters = na.omit(Hitters) # Remove rows with missing Salary
We have already performed a preliminary analysis of this dataset,
which involves many variables that are well correlated with
cmedv, the target variable. The spike and slab prior is
thus appealing as it shrinks to zero the predictors that are not
relevant.
# Model settings:
Y = as.numeric(scale(Hitters$Salary))
X = model.matrix(Salary ~., data = Hitters)[,-1]
P = ncol(X)
N = nrow(X)
# JAGS inputs:
var_beta=rep(1, P)
dataList = list(Y=Y,X=X,var_beta=var_beta,N=N,P=P)
burn = 5000
n.iter = 15000
n.adapt = 1000
thin = 10
n.chains = 1
The code for this model is quite challenging compared to the previous ones. Firstly, we have to combine the probit model with the spike & slab priors, then we need to track the nonzero coefficients in all models. This task is accomplished using a binary encoding which is explained below.
model_string = textConnection(
"model {
# Likelihood
for (i in 1:N) {
Y[i] ~ dnorm(beta0+inprod(X[i,],beta[]),inv.var.y)
}
# Model encoding
for (j in 1:P) { # at every iteration, recognise the model
inclusionindex[j] = g[j]*pow(2,j) # given the selected variables
}
mdl = 1 + sum(inclusionindex[])
# Priors
inv.var.y ~ dgamma(0.01,0.01)
beta0 ~ dnorm(0,0.01)
for(j in 1:P) {
inv_var_beta[j] = 1/var_beta[j]
}
for(j in 1:P) {
g[j] ~ dbern(theta[j])
theta[j] ~ dunif(0,1)
betaTemp[j] ~ dnorm(0,inv_var_beta[j])
beta[j] = g[j]*betaTemp[j]
}
}")
This time, we also specify the starting values for the parameters in
inits, then we sample.
# Specify the initial values:
inits = list(beta0=0,betaTemp=rep(0,P),g=rep(0,P),theta=rep(0.5,P))
# Initialization and burn-in step:
model = jags.model(model_string,data=dataList,n.adapt=n.adapt,inits=inits,n.chains=n.chains)
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 263
## Unobserved stochastic nodes: 59
## Total graph size: 6212
##
## Initializing model
update(model,n.iter=burn)
# Produce the samples
param = c("beta0","beta","mdl")
samples = coda.samples(model=model,variable.names=param,n.iter=n.iter,thin=thin)
We view the results through the trace plot and the approximated
densities. If we produce n.chains sequences of samples, we
may notice that some sequence have different trajectories when
initialized from different values.
# `beta` traceplots and posteriors
samples_idx = sample(1:12,12)
par(mfrow=c(3,3))
plot(samples[,samples_idx])
In the next example we consider a method for selecting the best model among those that were estimated. In fact, for each sample the algorithm may have set the value of a parameter equal to zero, or nonzero. Hence, a possibly different model, depending on which variables are active, is sampled using the spike and slab prior.
In the following application we consider the reach
dataset:
Rheumatoid arthritis is an autoimmune disease characterized by chronic synovial inflammation and destruction of cartilage and bone in the joints. The Rotterdam Early Arthritis Cohort (REACH) study was initiated in 2004 to investigate the development of rheumatoid arthritis in patients with early manifestations of joint impairment. Information regarding basic patient characteristics, serological measurements, and patterns of disease involvement at baseline has been gathered in 681 recruited patients.
In particular, we aim to understand which of the following 12 factors are potentially associated with the development of rheumatoid arthritis:
Because the response variable is binary, i.e. indicating whether the patient has developed or not this disease, we need to use a logit or probit link function, as introduced in Exercise 4.
rm(list=ls())
library(rjags)
library(coda)
# Import the data
reach = read.table("reach.txt", header=T)
reach$gender = ifelse(reach$gender == 1,0,1)
# Define the inputs
p = 12
N = nrow(reach)
X = reach[1:N,1:12]
Y = reach[1:N,13]
var_beta=rep(1, p)
dataList = list(Y=Y,X=X,var_beta=var_beta,N=N,p=p)
# Input for JAGS:
burn = 5000
n.iter = 10000
thin = 10
Firstly, we have to combine the probit model with the spike & slab priors, then we need to track the nonzero coefficients in all models. This task is accomplished using a binary encoding which is explained below.
model_string = textConnection(
"model {
# Likelihood
for (i in 1:N) {
Z[i] = phi( beta0 + inprod(X[i,],beta) )
Y[i] ~ dbern(Z[i])
}
# Model encoding
for (j in 1:p) {
inclusionindex[j] = g[j]*pow(2,j)
}
mdl = 1 + sum(inclusionindex[])
# Priors
beta0 ~ dnorm(0,0.01)
for(j in 1:p) {
inv_var_beta[j] = 1/var_beta[j]
}
for(j in 1:p) {
g[j] ~ dbern(theta[j])
theta[j] ~ dunif(0,1)
betaTemp[j] ~ dnorm(0,inv_var_beta[j])
beta[j] = g[j]*betaTemp[j]
}
}")
Specify the starting values for the parameters in inits
and sample.
# Initial values
inits = list(beta0=0,betaTemp=rep(0,p),g=rep(0,p),theta=rep(0.5,p))
model = jags.model(model_string,data=dataList,n.adapt=100,inits=inits)
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 681
## Unobserved stochastic nodes: 37
## Total graph size: 11683
##
## Initializing model
update(model,n.iter=burn)
# Which parameters to store?
param = c("beta0","beta","g","mdl")
samples = coda.samples(model=model,variable.names=param,n.iter=n.iter,thin=thin)
# Trace plot and posteriors
par(mfrow=c(3,3))
plot(samples[,1:12])
We immediately notice that some coefficients are zero,
e.g. beta[2] and beta[5], while some are not,
like beta[1] and beta[8]. However, it is
possible that the coefficient is sampled as zero most of the times, but
sometimes it is not, ee for example beta[9] which displays
light bimodality.
How to select a variable?
For example, the median probability model (MPM) picks the variables
with an estimated posterior inclusion probability that is higher than
\(0.5\). The posterior probability of
inclusion corresponds to the posterior mean of the \(\gamma_j\), which are called g
in the code. In practice, we use a threshold to discriminate among the
variables.
# Compute the posterior mean for g[j]:
gsamples = samples[[1]][,14:25]
post_mean_g = apply(gsamples, 2, "mean")
# Print the probabilities of inclusion:
post_mean_g
## g[1] g[2] g[3] g[4] g[5] g[6] g[7] g[8] g[9] g[10] g[11] g[12]
## 1.000 0.064 1.000 0.359 0.184 0.560 0.110 1.000 0.562 0.075 0.997 0.186
We can visualize them with a nice chart.
# Load the library:
library(ggplot2)
postmeang_df = data.frame(value=post_mean_g,var=colnames(X))
p1 = ggplot(data=postmeang_df,aes(y=value,x=var,fill=var)) +
geom_bar(stat="identity") +
geom_hline(mapping=aes(yintercept=0.5),col=2,lwd=1.1) +
coord_flip() + theme_minimal() + theme(legend.position="none") +
ylab("Posterior inclusion probabilities") + xlab("")
p1
We select the variables such that the posterior probability of inclusion is greater than \(0.5\).
# Select best model according to MPM
mp_SpSl = which(post_mean_g>0.5)
post_mean_g[mp_SpSl]
## g[1] g[3] g[6] g[8] g[9] g[11]
## 1.000 1.000 0.560 1.000 0.562 0.997
Alternatively, we may use the highest posterior density (HPD) model.
In such a case, the model with the highest posterior probability is
chosen. Recall that we have kept track of the models using the binary
encoding in mdl, that is
\[ \text{mdl} = 1 + 2^1 + \dots + 2^{p}. \]
The chain corresponding to mdl is stored and can be
viewed. We are interested in the model that is visited most times,
i.e. the HPD model.
# Trace plot and density for `mdl`:
plot(samples[,"mdl"])
How many unique models have been visited throughout the chain?
# Get the number of unique models:
length(table(samples[,"mdl"]))
## [1] 113
What is the observed frequency, i.e. the posterior probability of
each model? We can plot the frequency with a barplot where the model
encoding is reported on the x axis.
# Post frequency of visited models
visited_models = sort(table(samples[,"mdl"]),decreasing=TRUE)
barplot(visited_models/sum(table(samples[,"mdl"])),
xlab="Model encoding",ylab="Posterior probability")
With the following code, we select the HPD model:
# Get the unique profiles and sort the results
unique_model = unique(as.matrix(gsamples),MARGIN=1)
freq = apply(unique_model,1,function(b) sum(apply(gsamples,1,function(a) all(a == b))))
# View the unique models and their frequency:
#cbind(unique_model[order(freq,decreasing = T),], sort(freq,decreasing = T))
# Select the HPD model and print the variable names:
colnames(X)[as.logical(unique_model[which.max(freq),])]
## [1] "accp" "esr" "rf" "sym" "SJC" "bcp_hands"
HDPmodel = c(1:12)[as.logical(unique_model[which.max(freq),])]
# Display the variables' number:
HDPmodel
## [1] 1 3 6 8 9 11